arXiv: 1501.01924vl [cs.DB] 8 Jan 2015 


Less is More: Building Selective Anomaly Ensembles 

with Application to Event Detection in Temporal Graphs 

Shebuti Rayana Leman Akoglu 

Stony Brook University Stony Brook University 

sray ana @ cs. stonybrook.edu leman @ cs. stonybrook.edu 


Abstract 

Ensemble techniques for classification and clustering have 
long proven effective, yet anomaly ensembles have been 
barely studied. In this work, we tap into this gap and propose 
a new ensemble approach for anomaly mining, with applica¬ 
tion to event detection in temporal graphs. Our method aims 
to combine results from heterogeneous detectors with vary¬ 
ing outputs, and leverage the evidence from multiple sources 
to yield better performance. However, trusting all the re¬ 
sults may deteriorate the overall ensemble accuracy, as some 
detectors may fall short and provide inaccurate results de¬ 
pending on the nature of the data in hand. This suggests that 
being selective in which results to combine is vital in build¬ 
ing effective ensembles—hence “less is more”. 

In this paper we propose SELECT; an ensemble ap¬ 
proach for anomaly mining that employs novel techniques to 
automatically and systematically select the results to assem¬ 
ble in a fully unsupervised fashion. We apply our method to 
event detection in temporal graphs, where SELECT success¬ 
fully utilizes five base detectors and seven consensus meth¬ 
ods under a unified ensemble framework. We provide ex¬ 
tensive quantitative evaluation of our approach on five real- 
world datasets (four with ground truth), including Enron 
email communications. New York Times news corpus, and 
World Cup 2014 Twitter news feed. Thanks to its selection 
mechanism, SELECT yields superior performance compared 
to individual detectors alone, the full ensemble (naively com¬ 
bining all results), and an existing diversity-based ensemble. 

1 Introduction 

Ensemble methods utilize multiple algorithms to obtain bet¬ 
ter performance than the constituent algorithms alone and 
produce more robust results [5], Thanks to these advan¬ 
tages, a large body of research has been devoted to ensem¬ 
ble learning in classification [13, 21, 23, 26] and clustering 
[8, 11, 12, 25], On the other hand, building effective ensem¬ 
bles for anomaly detection has proven to be a challenging 
task [1, 27], A key challenge is the lack of ground-truth; 
which makes it hard to measure detector accuracy and to 
accordingly select accurate detectors to combine, unlike in 


classification. Moreover, there exist no objective or ‘fitness’ 
functions for anomaly mining, unlike in clustering. 

Existing attempts for anomaly ensembles either com¬ 
bine outcomes from all the constituent detectors [9, 10, 16, 
19], or induce diversity among their detectors to increase the 
chance that they make independent errors [24, 28], How¬ 
ever, as our prior work [22] suggests, neither of these strate¬ 
gies would work well in the presence of inaccurate detec¬ 
tors. In particular, combining all, including inaccurate re¬ 
sults would deteriorate the overall ensemble performance. 
Similarly, diversity-based ensembles would combine inac¬ 
curate results for the sake of diversity. 

In this work, we tap into the gap between anomaly min¬ 
ing and ensemble methods, and propose SELECT, one of the 
first selective ensemble approaches for anomaly detection. 
As the name implies, the key property of our ensemble is its 
selection mechanism which carefully decides which results 
to combine from multiple different methods in the ensemble. 
We summarize our contributions as follows. 

• We identify and study the problem of building selective 
anomaly ensembles in a fully unsupervised fashion. 

• We propose SELECT, a new ensemble approach for 
anomaly detection, which utilizes not only multiple 
heterogeneous detectors, but also various consensus 
methods under a unified ensemble framework. 

• SELECT employs two novel unsupervised selec¬ 
tion strategies that we design to choose the detec¬ 
tor/consensus results to combine, which render the en¬ 
semble not only more robust but improve its perfor¬ 
mance further over its non-selective counterpart. 

• Our ensemble approach is general and flexible. It 
does not rely on specific data types, and allows other 
detectors and consensus methods to be incorporated. 

We apply our ensemble approach to the event detection 

problem in temporal graphs, where SELECT utilizes five 
heterogeneous event detection algorithms and seven different 
consensus methods. Extensive evaluation on datasets with 
ground truth shows that SELECT outperforms the average 
individual detector, the full ensemble that naively combines 
all results, as well as the diversity-based ensemble in [24]. 


2 Background and Preliminaries 

2.1 Event Detection Problem Temporal graphs change 
dynamically over time in which new nodes and edges arrive 
or existing nodes and edges disappear. Many dynamic sys¬ 
tems can be modeled as temporal graphs, such as computer, 
trading, transaction, and communication networks. 

Event detection in temporal graph data is the task of 
finding the points in time at which the graph structure no¬ 
tably differs from its past. These change points may cor¬ 
respond to significant events; such as critical state changes, 
anomalies, faults, intrusion, etc. depending on the applica¬ 
tion domain. Formally, the problem can be stated as follows. 
Given a sequence of graphs {Gi, G 2 ,..., G t ,..., Gt}', 
Find time points t' s.t. G t i differs significantly from G t '-\- 

2.2 Motivation for Ensembles Several different methods 
have been proposed for the above problem, a survey of which 
is given in [3], To date, however, there exists no single 
method that has been shown to outperform all the others. 
The lack of a winner technique is not a freak occurrence. 
In fact, it is unlikely that a given method could perform 
consistently well on different data of varying nature. Further, 
different techniques may identify different classes or types 
of anomalies depending on their particular formulation. This 
suggests that effectively combining the results from various 
different detection methods (detectors from here onwards) 
could help improve the detection performance. 

2.3 Motivation for Selective Ensembles Ensembles are 
expected to perform superior to their average constituent 
detector, however a naive ensemble that trusts results from 
all detectors may not work well. The reason is, some 
methods may not be as effective as desired depending on 
the nature of the data in hand, and fail to identify the 
anomalies of interest. As a result, combining accurate 
results with inaccurate ones may deteriorate the overall 
ensemble performance [22]. This suggests that selecting 
which detectors to assemble is a critical aspect of building 
robust ensembles—which implies that “less is more”. 

To illustrate the motivation for (selective) ensemble 
building further, consider the example in Figure 1. The 
rows show the anomaly scores assigned by five different 
detectors to time points in the Enron Inc.’s time line. Notice 
that the scores are of varying nature and scale, due to 
different formulations of the detectors. We realize that the 
detectors mostly agree on the events that they detect; e.g., ‘J. 
Skilling new CEO’. On the other hand, they assign different 
magnitude of anomalousness to the time points; e.g., the top 
anomaly of methods varies. These suggest that combining 
the outcomes could help build improved ranking of the 
anomalies. Next notice the result provided by “Probabilistic 
Approach” which, while identifying one major event also 
detected by other detectors, fails to provide a reliable ranking 
for the rest; e.g., it scores many other time points higher than 



Figure 1: Anomaly scores from five detectors (rows) for the Enron 
Inc. time line. Red bars depict top 20 anomalous time points. 


‘F. Cooper new CEO’. As such, including this detector in the 
ensemble is likely to deteriorate the overall performance. 

In summary, inspired by the success of classification 
and clustering ensembles and driven by the limited work on 
anomaly ensembles, we aim to systematically combine the 
strengths of accurate detectors while alleviating the weak¬ 
nesses of the less accurate ones to build selective detection 
ensembles for anomaly mining. While we build ensembles 
for the event detection problem in this paper, our approach 
is general and can directly be employed on a collection of 
detection methods for other anomaly mining problems. 

3 SELECT: Selective Ensemble Learning for anomaly 

detECTion — Application to Event Detection 
3.1 Overview Our SELECT approach takes the input data, 
in this case a sequence of graphs {Gi,..., Gt,..., Gt}. and 
outputs a rank list R of objects, in this case of time points 
1 < t < T, ranked from most to least anomalous. 

The main steps of SELECT are given in Algorithm 1. 
Step 1 employs (five) different event detection algorithms as 
base detectors of the ensemble. Each detector has a specific 
and different measure to score the individual time points by 
anomalousness. As such, the ensemble embodies heteroge¬ 
neous detectors. As motivated earlier. Step 2 selects a subset 
of the detector results to assemble through a proposed selec¬ 
tion strategy. Step 3 then combines the selected results into 






































a consensus. Besides several different event detection algo¬ 
rithms, there also exist various different consensus finding 
approaches. In spirit of building ensembles, SELECT also 
leverages (seven) different consensus techniques to create in¬ 
termediate aggregate results. Similar to Step 2, Step 4 then 
selects a subset of the consensus results to assemble. Finally, 
Step 5 combines this subset into the final rank list of time 
points using inverse rank aggregation (Section 3.3). 

Algorithm 1 SELECT 

Input: Data: graph sequence {Gi,..., G t ,..., Gt} 
Output: Rank list of objects (time points) by anomaly 
1: Obtain results from (5) base detectors 
2: Select set E of detectors to assemble 
3: Combine E by (7) consensus techniques 
4: Select set C of consensus results to assemble 
5: Combine C into final rank list 


Different from prior works, (i) SELECT is a two-phase 
ensemble that not only leverages multiple detectors but also 
multiple consensus techniques, and ( ii ) it employs novel 
strategies to carefully select the ensemble components to as¬ 
semble without any supervision, which outperform naive (no 
selection) and diversity-based selection (Section 4). More¬ 
over, (Hi) SELECT is the first ensemble method for event de¬ 
tection in temporal graphs, although the same general frame¬ 
work as presented in Algorithm 1 can be deployed for other 
anomaly mining tasks, where the base detectors are replaced 
with a set of algorithms for the particular task at hand. 

Next we fill in the details on the three main components 
of the proposed SELECT ensemble. In particular, we de¬ 
scribe the base detectors (Section 3.2), consensus techniques 
(Section 3.3), and the selection strategies (Section 3.4). 

3.2 Base Detectors There exist various methods for the 
event detection problem in temporal graphs [3]. In this work 
SELECT employs five base detectors (Algorithm 1, Line 1), 
while one can easily expand the ensemble with others: (1) 
eigen-behavior based event detection (EBED) from our prior 
work [2], (2) probabilistic time series anomaly detection (PT- 
SAD) we developed recently [22], (3) Streaming Pattern 
DIscoveRy in multiple Time-Series (SPIRIT) by Papadim- 
itriou el al. [20], (4) anomalous subspace based event detec¬ 
tion (ASED) by Lakhina el al. [18], and (5) moving-average 
based event detection (MAED). All methods extract graph¬ 
centric features (e.g., degree) for all nodes over time and de¬ 
tect events in multi-variate time series. We provide brief de¬ 
scriptions of the methods in Appendix A due to space limit. 

3.3 Consensus Finding Our ensemble consists of hetero¬ 
geneous detectors. That is, the detectors employ different 
anomaly scoring functions and hence their scores may vary 
in range and interpretation (see Figure 1). Unifying these 
various outputs to find a consensus among detectors is an 
essential step toward building an ensemble. 


A number of different consensus finding approaches 
have been proposed in the literature, which can be catego¬ 
rized into two, as rank based and score based aggregation 
methods. Without choosing one over the other, we utilize 
seven well-established methods as we describe below. 

Rank based consensus. Rank based methods use the 
anomaly scores to order the data points (here, time points) 
into a rank list. This ranking makes the algorithm outputs 
comparable and facilitates combining them. Merging multi¬ 
ple rank lists into a single ranking is known as rank aggrega¬ 
tion, which has a rich history in theory of social choice and 
information retrieval [6]. SELECT employs three rank based 
consensus methods. Kemeny-Young [14] is a voting tech¬ 
nique that uses preferential ballot and pair-wise comparison 
counts to combine multiple rank lists, in which the detectors 
are treated as voters and the points as the candidates they 
vote for. Robust Rank Aggregation (RRA) [15] utilizes or¬ 
der statistics to compute the probability that a given ordering 
of ranks for a point across detectors is generated by the null 
model where the ranks are sampled from a uniform distri¬ 
bution. The final ranking is done based on this probability, 
where more anomalous points receive a lower probability. 
The third approach is based on Inverse Rank aggregation, in 
which we score each point by — where r, denotes its rank 
by detector i and average these scores across detectors based 
on which we sort the points into a final rank list. 

Score based consensus. Rank-based aggregation provides 
a crude ordering of the data points, as it ignores the actual 
anomaly scores and their spacing. For instance, quite dif¬ 
ferent rankings can yield equal performance in binary de¬ 
cision. Score-based aggregation approaches tackle the cal¬ 
ibration of different anomaly scores and unify them within 
a shared range. SELECT employs two score based consen¬ 
sus methods. Mixture Modeling [10] converts the anomaly 
scores into probabilities by modeling them as sampled from 
a mixture of exponential (for inliers) and Gaussian (for out¬ 
liers) distributions. Unification [16] also converts the scores 
into probability estimates through regularization, normaliza¬ 
tion, and scaling steps. The probabilities are then compara¬ 
ble across detectors, which we aggregate by both max and 
avg. This yields four score based methods. 

3.4 Ensemble Learning Given different base detectors 
and various consensus methods, the final task remains to 
utilize them under a unified ensemble framework. In this 
section, we discuss four different approaches for building 
anomaly ensembles. These approaches differ in whether and 
how they select their ensemble components. 

3.4.1 Full ensemble The full ensemble selects all the de¬ 
tector results (Step 2 of Alg.l) and later all the consensus 
results (Step 4 of Alg.l) to aggregate at both phases of SE¬ 
LECT. As such, it is a naive approach that is prone to obtain 
inferior results in the presence of inaccurate detectors. 






3.4.2 Selective ensembles As motivated earlier in Section 
2.3, carefully selecting which detectors to assemble in Step 2 
may help prevent the final ensemble from going astray, pro¬ 
vided that some base detectors may fail to reliably identify 
the anomalies of interest to a given application. Similarly, 
pruning away consensus results that may be noisy in Step 4 
could help reach a stronger final consensus. In anomaly min¬ 
ing, however, it is challenging to identify the components 
with inferior results given the lack of ground truth to esti¬ 
mate their generalization errors externally. In this section, 
we present two orthogonal selection strategies that leverage 
internal clues across detectors or consensuses and work in 
a fully unsupervised fashion: (i) a vertical strategy that ex¬ 
ploits correlations among the results, and (ii) a horizontal 
strategy that uses order statistics to filter out far-off results. 
Strategy I: Vertical Selection. Our first approach to 
selecting the ensemble components is through correlation 
analysis among the score lists from different methods, based 
on which we successively enhance the ensemble one list at a 
time (hence vertical). The work flow of the vertical selection 
strategy is given in Algorithm 2. 

Given a set of anomaly score lists S, we first unify 
the scores by converting them to probability estimates using 
Unification [16]. Then we average the probability scores 
across lists to construct a target vector, which we treat as 
the “pseudo ground-truth” (Lines 1-6). 

We initialize the ensemble E with the list l £ S that 
has the highest weighted Pearson correlation to target. In 
computing the correlation, the weights we use for the list 
elements are equal to -, where r is the rank of an element 
in target when sorted in descending order, i.e., the more 
anomalous elements receive higher weight (Lines 7-11). 

Next we sort the remaining lists S\l in descending 
order by their correlation to the current “prediction” of the 
ensemble, which is defined as the average probability of lists 
in the ensemble. We test whether adding the top list to the 
ensemble would increase the correlation of the prediction 
to target. If the correlation improves by this addition, we 
update the ensemble and reorder the remaining lists by their 
correlation to the updated prediction, otherwise we discard 
the list. As such, a list gets either included or discarded at 
each iteration until all lists are processed (Lines 12-19). 

Strategy II: Horizontal Selection. We are interested in 
finding time points that are ranked high in a set of accurate 
rank lists (from either base detectors or consensus methods), 
ignoring a (small) fraction of inaccurate rank lists. Thus, we 
also present an element-based (hence horizontal) approach 
for selecting ensemble components. 

To identify the accurate lists, this strategy focuses on the 
anomalous elements. It assumes that the normalized ranks 
of the anomalies should come from a distribution skewed 
toward zero. Based on this, lists in which the anomalies 
are not ranked sufficiently high (i.e., have large normalized 


Algorithm 2 Vertical Selection 
Input: S := set of anomaly score lists 
Output: E := ensemble set of selected lists 
1: P := 0 

2: /* convert scores to probability estimates */ 

3: for each s £ S' do 

4: P := P U Unification(s) 

5: end for 

6: target := avg(P) /*target vector*/ 

7: r := ranklist after sorting target in descending order 

8: E := 0 

9: sort P by weighted Pearson ( wP) correlation to target 
10: /* in descending order, weights: - *1 
11: l := fetchFirst(P), E:=EUl 

12 : while P ^ 0 do 

13: p := avg(E) /*current prediction of E*/ 

14: sort P by wP correlation to p /*descending order*/ 

15: l := fetchFirst(P) 

16: if wP(avg(E U l), tar get) > wP(p 7 target) then 

17: E := E Ul /*select list*/ 

18: end if 

19: end while 
20: return E 


ranks) are considered to be inaccurate and voted for being 
discarded. The work flow of the horizontal selection strategy 
is given in Algorithm 3. 

Similar to the vertical strategy we first identify a 
“pseudo ground truth”, in this case a list of anomalies. In par¬ 
ticular, we use Mixture Modeling [10] to convert each score 
list in S into a binary list in which outliers are denoted by 1, 
and inliers by 0. We then employ majority voting across lists 
to obtain a final set of target anomalies O (Lines 1-7). 

Given that S contains m lists, we construct a normalized 
rank vector r = [r^,..., r( m )] for each anomaly o £ O, 
such that r(-|) < ... < r( m ), where rm denotes the 
rank of o in list l £ S normalized by the total number 
of elements in l. Following similar ideas to Robust Rank 
Aggregation [15], we then compute order statistics based on 
these sorted normalized rank lists to identify the lists that 
provide statistically large ranks for each anomaly. 

Specifically, for each ordered list l in a given r, we 
compute how probable it is to obtain fm < r(q when the 
ranks f are generated by a uniform null distribution. We 
denote the probability that rpj < by p/ jm (r). Under 
the uniform null model, the probability that f (l) is smaller or 
equal to rp) can be expressed as a binomial probability 

m / \ 

Rm(r) = THjU-rd,)"*-*, 

t=i ' ' 

since at least l normalized rankings drawn uniformly 
from [0,1] must be in the range [0, rp)]. 






Algorithm 3 Horizontal Selection 
Input: S := set of anomaly score lists 
Output: E := ensemble set of selected lists 

1: M:=0, R :=(/), F := 0 , E := 0 

2: for each l e 5 do 

3: /* label score lists with 1 (outliers) & 0 (inliers) */ 

4: class := MixtureModel(l) , M := M U cfass 

5: R := R U ranklist(l ) 

6: end for 

7: O := majorityVoting(M) /*target anomalies*/ 
8: [S'sor-tjpVa^s] := RobustRankAggregation(R 1 0) 

9: for each o £ O do 

10: m in d ■■= min (pVals(o, :)) 

11 : F := FU S sort (o, {m ind + 1) : end) 

12 : end for 

13: for each l e S do 

14: count := number of occurrences of l in F 

15: end for 

16: Cluster non-zero counts into two clusters, G'/ and Ch 
17: E := S \ {s £ Ch} /* discard high -count lists */ 

18: return E 


For a sequence of accurate lists that rank the anomalies 
at the top, and hence that yield low normalized ranks 
this probability is expected to drop with the ordering, i.e., for 
increasing l £ {1... m}. An example sequence of p proba¬ 
bilities (y-axis) are shown in Figure 2 for an anomaly based 
on 20 score lists. The lists are sorted by their normalized 
ranks of the anomaly on the rr-axis. The figure suggests that 
the 5 lists at the end of the ordering are likely inaccurate, as 
the ranks of the given anomaly in those lists are larger than 
what is expected based on the ranks in the other lists. 



Figure 2: Normalized rank rp) vs. probability p that fm < rp), 
where f are drawn uniformly at random from [0,1]. 

Based on this intuition, we count the frequency that each 
list l is ordered after the list with min; = i ] ... im pp m { r) among 
all the normalized rank lists r of the target anomalies (Lines 
8-15). We then group these counts into two clusters 1 and 
discard the lists in the cluster with the higher average count 
(Lines 16-17). This way we eliminate the lists with larger 
counts, but retain the lists that appear inaccurate only a few 
times which may be a result of the inherent uncertainty or 
noise in which we construct the target anomaly set. 

1 We cluster the counts by /.'-means clustering with k = 2, where the 
centroids are initialized with the smallest and largest counts, respectively. 


3.4.3 Diversity-based ensemble In classification, two ba¬ 
sic conditions for an ensemble to improve over the con¬ 
stituent classifiers are that the base classifiers are (i) accu¬ 
rate (better than random), and (ii) diverse (making uncorre¬ 
lated errors) [5, 26]. Achieving better-than-random accuracy 
in supervised learning is not hard, and several studies have 
shown that ensembles tend to yield better results when there 
is a significant diversity among the models [4, 17], 

Following on these insights, Schubert et al. proposed a 
diversity-based ensemble [24], which is similar to our verti¬ 
cal selection in Alg. 2. The main distinction is the ascending 
ordering in Lines 9 and 14, which yields a diversity-favored, 
in contrast to a correlation-favored, selection. 2 

Unlike classification ensembles, however, it is not re¬ 
alistic for anomaly ensembles to assume that all the detec¬ 
tors will be reasonably accurate (i.e., better than random), as 
some may fail to spot the (type of) anomalies in the given 
data. In the existence of inaccurate detectors, the diversity- 
based approach would likely yield inferior results as it is 
prone to selecting inaccurate detectors for the sake of diver¬ 
sity. As we show in our experiments, too much diversity is in 
fact bound to limit accuracy for event detection ensembles. 

4 Evaluation 

We evaluate our selective ensemble approach on the event 
detection problem using five real-world datasets, both previ¬ 
ously used as well as newly collected by us, including email 
communications, news corpora, and social media. For four 
of these datasets we compiled ground truths for the temporal 
anomalies, for which we present quantitative results. We use 
the remaining data for illustrating case studies. 

We compare the performance of SELECT with vertical 
selection (SelectV), and horizontal selection (SelectH) to that 
of individual detectors, the full ensemble with no selection 
(Full), and the diversity-based ensemble (DivE) [24], This 
makes ours one of the few works that quantitatively com¬ 
pares and contrasts anomaly ensembles at a scale that in¬ 
cludes as many datasets with ground truth. 

In a nutshell, our results illustrate that (i) base detec¬ 
tors do not always all produce accurate results, (ii) en¬ 
semble approach alleviates the shortcomings of the inaccu¬ 
rate detectors, (Hi) a careful selection of ensemble compo¬ 
nents increases the overall performance, and (iv) introducing 
noisy results decreases overall ensemble accuracy where the 
diversity-based ensemble is affected the most. 

4.1 Dataset Description In the following we describe the 
five real-world temporal graph datasets we used in this work. 
All datasets with ground truth events are made available at 

http : //shebuti . com/SelectiveAnomalyEnsemble/. 

-There are other differences between our vertical selection (Algorithm 

2) and the diversity-based ensemble in [24], such as the construction of the 
pseudo ground truth and the choice of weights in correlation computation. 









Dataset 1: Enronlnc. We use four years (1999-2002) of 
Enron email communications. In the temporal graphs, the 
nodes represent email addresses and directed edges depict 
sent/received relations. Enron email network contains a total 
of 80,884 nodes. We analyze the data with daily sample rate 
skipping the weekends (700 time points). The ground truth 
captures the major events in the company’s history, such as 
CEO changes, revenue losses, restatements of earnings, etc. 
Dataset 2: RealityMining Reality Mining is comprised of 
communication and proximity data of 97 faculty, student, 
and staff at MIT recorded continuously via pre-installed 
software on their mobile devices over 50 weeks [7]. From 
the raw data we built sequences of weekly temporal graphs 
for three types of relations; voice calls, short messages, and 
bluetooth scans. For voice call and short message graphs a 
directed edge denotes an incoming/outgoing call or message, 
and for bluetooth graphs an edge depicts physical proximity 
between two subjects. The ground truth captures semester 
breaks, exam and sponsor weeks, and holidays. 

Dataset 3: TwitterSecurity We collect tweet samples using 
the Twitter Streaming API for four months (May 12-Aug 1, 
2014). We filter the tweets containing Department of Home¬ 
land Security keywords related to terrorism or domestic se¬ 
curity. 3 After named entity extraction and resolution (includ¬ 
ing URLs, hashtags, @ mentions), we build entity-entity co¬ 
mention temporal graphs on daily basis (80 time ticks). We 
compile the ground truth to include major world news of 
2014, such as the Turkey mine accident. Boko Haram kid¬ 
napping school girls, killings during Yemen raids, etc. 
Dataset 4: TwitterWorldCup Our Twitter collection also 
spans the World Cup 2014 season (June 12-July 13). This 
time, we filter the tweets by popular/official World Cup hash- 
tags, such as iworldcup, #fifa, #brazil, etc. Similar to 
TwitterSecurity, we construct entity-entity co-mention tem¬ 
poral graphs on 5 minute sample rate (8640 time points). The 
ground truth contains the goals, penalties, and injuries in all 
the matches that involve at least one of the renowned teams 
(Brazil, Germany, Argentina, Netherlands, Spain, France). 
Dataset 5: NYTNews This corpus contains all of the 
published articles in New York Times over 7.5 years (Jan 
2000-July 2007) (available from https : //catalog, ldc . 
upenn.edu/LDC2008T19). The named entities (people, 
places, organizations) are hand-annotated by human editors. 
We construct weekly temporal graphs (390 time points) in 
which each node corresponds to a named entity and edges 
depict co-mention relations in the articles. The data contains 
around 320,000 entities, however no ground truth events. 

4.2 Event Detection Performance Next we quantita¬ 
tively evaluate the ensemble methods on detection accuracy. 
The final result output by each ensemble is a rank list, based 

^http : //www . huffingtonpost . com/2012/02/24/ 
homeland-security-manual_n_l2 9 9908 . html 


Table 1: Accuracy of ensembles for Enronlnc. (features: weighted 
in-/out-degree). * depicts selected detector/consensus results. 




Full 

DivE 

SelectV 

SelectH 


EBED (win) 

0.1313 

* 

* 



PTSAD (win) 

0.1462 

* 




SPIRIT (win) 

0.7032 

* 


* 


ASED (win) 

0.5470 

* 

* 

* 


MAED (win) 

0.6670 



* 


EBED (wout) 

0.2846 

* 



05 

PTSAD (wout) 

0.2118 

* 



03 

SPIRIT (wout) 

0.4563 

* 


* 


ASED (wout) 

0.0580 

* 




MAED (wout) 

0.7328 


* 

* 


Inverse Rank 

* 0.6829 

* 0.5660 

0.6738 

* 0.8291 


Kemeny-Young 

* 0.4086 

* 0.3703 

* 0.6586 

* 0.6334 

3 

>5 

RRA 

* 0.6178 

0.4871 

0.5686 

* 0.6590 


Uni (avg) 

* 0.5292 

* 0.5511 

* 0.6375 

* 0.6207 

£ 

Uni (max) 

* 0.3333 

* 0.3187 

0.4314 

* 0.7353 


MM (avg) 

* 0.7513 

* 0.5726 

* 0.7663 

* 0.7530 


MM (max) 

* 0.0218 

* 0.0218 

0.2108 

0.0224 

Final Ensemble 

0.7082 

0.6276 

0.7125 

0.7920 


on which we create the precision-recall (PR) plot for a given 
ground truth. We report the area under the PR plot, namely 
average precision, as the measure of accuracy. 

Table 1 shows the accuracies for all four ensemble 
methods on Enronlnc., along with the accuracies of the base 
detectors and consensus methods. Notice that some detectors 
yield quite low accuracy (e.g., ASED (wout)) on this dataset. 
Further, MM (max) consensus provides low accuracy across 
ensembles no matter which detector results are combined. 
SELECT ensembles successfully filter out relatively inferior 
results and achieve higher accuracy. We also note that 
DivE yields lower performance than all, including Full. 

To investigate the significance of the selections made 
by SELECT ensembles, we compare them to ensembles that 
randomly select the same number of components to assem¬ 
ble at each phase. In Table 2 we report the average and stan¬ 
dard deviation of accuracies achieved by 100 such random 
ensembles, denoted by RandE, and the gain achieved by Se¬ 
lect)/ and SelectH over their respective random ensembles. 

We show the final anomaly scores of the time points 
provided by SelectH on Enronlnc. for visual analysis in 
Figure 3. The figure also depicts the ground truth events by 
vertical (red) lines, which we note to align well with the time 
points with high scores. 
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Figure 3: Anomaly scores of time points by SelectH on Enronlnc. 
align well with ground truth (vertical red lines). 




















Table 1 shows results when we use weighted node in- 
/out-degree features on the directed Enron graphs to con¬ 
struct the input time series for the base detectors. As such, 
the ensembles utilize 10 components in the first phase. We 
also build the ensembles using 20 components where we in¬ 
clude the unweighted in-/out-degree features. Table 3 in Ap¬ 
pendix B gives all the accuracy results and selections made, 
a summary of which is provided in Table 2. We notice that 
the unweighted graph features are less informative and yield 
lower accuracies across detectors on average. This affects 
the performance of Full and DivE, where the accuracies drop 
significantly. On the other hand, SELECT ensembles are able 
to achieve comparable accuracies with increased significance 
under the additional noisy input. 

Thus far, we used the exact time points of the events to 
compute precision and recall. In practice, some time delay 
in detecting an event is often tolerable. Therefore, we also 
compute the detection accuracy when delay is allowed; e.g., 
for delay 2, detecting an event that occurred at t within time 
window [f — 2, f + 2] is counted as accurate. Figure 4 shows 
the accuracy for 0 to 5 time point delays (days) for Enronlnc., 
where delay 0 is the same as exact detection. We notice that 
SELECT ensembles and Full can detect almost all the events 
within 5 days before or after each event occurs. 


Enron: 10 components Enron: 20 components 



Figure 4: Enronlnc. average precision vs. detection delay using 
(left) 10 components and (right) 20 components. 


Next we analyze the results for RealityMining. Simi¬ 
lar to Enronlnc., we build the ensembles using both 10 and 
20 components for the directed Voice Call and SMS graphs. 
Bluetooth graphs are undirected, as they capture (symmetric) 
proximity of devices, for which we build ensembles with 10 
components using weighted and unweighted degree features. 
All the details on detector and consensus accuracies as well 
as selections made are given in Appendix B due to space 
limit (Table 4 and Table 5 for Voice Call, Table 6 for Blue¬ 
tooth, Table 7 and Table 8 for SMS). We provide the sum¬ 
mary of results in Table 2. We note that SELECT ensembles 
provide superior results to Full and DivE. 

Figure 5 illustrates the accuracy-delay plots which show 
that SELECT ensembles for Bluetooth and SMS detect al¬ 
most all the events within a week before or after they occur, 
while the changes in Voice Call are relatively less reflective 
of the changes in the school year calendar. 

Finally, we study event detection using Twitter. Table 9 
in Appendix B contains accuracy details for detecting world 


Table 2: Significance of accuracy results compared to random 
ensembles with same number of selected components as SELECT. 



Accuracy 

significance 

Enronlnc. (10 comp.) (Full: 0.7082, DivE: 0.6276) 

(i) RandE (3/10, 3/7) 

SelectV 

0.4804 (/ 1 ) 
0.7125 

0.1757 (a) 

= At + 1.3210 a 

(ii) RandE (5/10,6/7) 

SelectH 

0.5509 (/t) 

0.7920 

0.1406 (a) 

= A* + 1.71480- 


Enronlnc. (20 comp.) (Full: 0.5420, DivE: 0.4697) 


(i) RandE (4/20, 2/7) 

0.4047 (At) 

0.1732 (a) 

SelectV 

0.7018 

= At + 1.7154a 

(ii) RandE (15/20, 6/7) 

0.5707 (At) 

0.0864 (o) 

SelectH 

0.7798 

= At + 2.4201a 


RM-VoiceCall (10 comp.) (Full: 0.7302. DivE: 0.8724) 


(i) RandE (2/10,1/7) 

0.7370 (a t) 

0.1551 (a) 

SelectV 

0.8370 

= At + 0.6447a 

(ii) RandE (8/10, 6/7) 

0.7653 (ai) 

0.0714 (a) 

SelectH 

0.9045 

= At + 1.9496a 


RM-VoiceCall (20 comp.) (Full: 0.8011. DivE: 0.8335) 


(i) RandE (2/20, 2/7) 

0.7752 (ai) 

0.1494 (a) 

SelectV 

0.8847 

= p + 0.7329a 

(ii) RandE (17/20, 6/7) 

0.8187 (a t) 

0.0497 (a) 

SelectH 

0.8949 

= At + 1.5332a 


RM-Bluetooth (10 comp.) (Full: 0.8398. DivE: 0.7735) 


(i) RandE (4/10,1/7) 

0.8269 (ai) 

0.1129 (a) 

SelectV 

0.9193 

= At + 0.8184a 

(ii) RandE (8/10,6/7) 

0.8410 (At) 

0.0322 (a) 

SelectH 

0.8886 

= At + 1.4783a 


RM-SMS (10 comp.) (Full: 0.9092, DivE: 0.8598) 


(i) RandE (4/10,1/7) 

0.8328 (ai) 

0.0978 (a) 

SelectV 

0.9283 

= At + 0.9765a 

(ii) RandE (8/10,6/7) 

0.8976 (ai) 

0.0620 (a) 

SelectH 

0.9217 

= A 4 + 0.3887a 


RM-SMS (20 comp.) (Full: 0.9542, DivE: 0.8749) 


(i) RandE (2/20,1/7) 

0.7685 (a t) 

0.1521 (a) 

SelectV 

0.9294 

= At + 1.0579a 

(ii) RandE (17/20, 5/7) 

0.9217 (At) 

0.0296 (a) 

SelectH 

0.9621 

= At + 1.3649a 


TwitterSecurity (10 comp.) (Full: 0.5200, DivE: 0.4800) 


(i) RandE (4/10,1/7) 

0.5068 (ai) 

0.0755 (a) 

SelectV 

0.5467 

= At + 0.5285a 

(ii) RandE (9/10, 3/7) 

0.5198 (ai) 

0.0538 (a) 

SelectH 

0.5867 

= ai + 1.2435a 


news on TwitterSecurity, a summary of which is included in 
Table 2. Results are in agreement with prior ones, where 
SelectFI outperforms the other ensembles. This further 
becomes evident in Figure 6 (left), where SelectH can detect 
all the ground truth events within 3 days delay. 

The detection dynamics change when TwitterWorldCup 
is analyzed. The events in this data such as goals and injuries 
are quite instantaneous (recall the 4 goals in 6 minutes by 
Germany against Brazil), where we use a sample rate of 5 
minutes. Moreover, such events are likely to be reflected on 
Twitter with some delay by social media users. As such, it 










































































Figure 5: RealityMining average precision vs. detection delay for (left to right) Voice Call (10 comp.). Voice Call (20 comp.), Bluetooth 
(10 comp.), SMS (10 comp.), and SMS (20 comp.). 


is extremely hard to pinpoint the exact time of the events 
by the ensembles. As we notice in Figure 6 (right), the 
initial accuracies at zero delay are quite low. When delay 
is allowed for up to 288 time points (i.e., one day), the 
accuracies incline to a reasonable level within half a day 
delay. In addition, all the detector and consensus results 
seem to contain signals in this case where most of them are 
selected by the ensembles, hence comparable accuracies. In 
fact, DivE selects all of them and performs the same as Full. 




Figure 6: Twitter average precision vs. detection delay for (left) 
Security and (right) WorldCup 2014. 

4.3 Noise Analysis Provided that selecting which results 
to combine would especially be beneficial in the presence of 
inaccurate detectors, we design experiments where we intro¬ 
duce increasing number of noisy results into our ensembles. 
In particular, we create noisy results by randomly shuffling 
the rank lists output by the base detectors and treat them 
as additional detector results. Figure 7 shows accuracies 
(avg.’ed over 10 independent runs) on all of our datasets for 
10 component ensembles (results using 20 components are 
similar, and provided in Figure 10 in Appendix B). We notice 
that SELECT ensembles provide the most stable and effec¬ 
tive performance under increasing number of noisy results. 
More importantly, these results show that DivE degenerates 
quite fast in the presence of noise, i.e., when the assumption 
that all results are reasonably accurate fails to hold. 


elections, 9/11 WTC attacks, and the 2003 Columbia Space 
Shuttle disaster. SelectH also ranks entities by association 
to a detected event for attribution. We note that for the 
Columbia disaster, NASA and the seven astronauts killed in 
the explosion rank at the top. The visualization of the change 
in Figure 9 shows that a heavy clique with high degree nodes 
emerges in the graph structure at the time of the event. 
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Figure 8: Anomaly scores from five base detectors (rows) for NYT 
news corpus. Top 3 events by the final ensemble are marked with 
green boxes, (red bars: top 20 anomalous time points per detector) 


4.4 Case Studies In this section we evaluate our ensemble 
approach qualitatively using the NYTNews corpus dataset, 
for which we do not have a compiled list of ground truth 
events. Figure 8 shows the anomaly scores for the 2000-2007 
time line, provided by the five base detectors using weighted 
degree feature (we have demonstrated a similar figure for 
Enronlnc. in Figure 1 for additional qualitative analysis). 

Top three events by SelectH are marked within boxes in 
the figure, and corresponds to major events such as the 2001 


Time tick 161 Time tick 162 



Figure 9: During 2003 Columbia disaster a clique of NASA and 
the seven killed astronauts emerges from time tick 161 to 162. 
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Figure 7: Ensemble accuracies drop when increasing number of random results are added, where decrease is most prominent for DivE. 


5 Conclusion 

In this work we have proposed SELECT, a new selective 
ensemble approach for anomaly mining, and applied it to the 
event detection problem in temporal graphs. SELECT is a 
two-phase approach that combines multiple detector results 
and then multiple consensuses, respectively. Motivated 
by our earlier observations [22] that inaccurate detectors 
may deteriorate overall ensemble accuracy, we designed 
two unsupervised selection strategies, SelectV and SelectH, 
which carefully choose which detector/consensus outcomes 
to assemble. We compared SELECT to Full, the ensemble 
that combines all results, and DivE, an existing ensemble [24] 
that combines diverse, i.e., least correlated results. 

Our quantitative evaluation on real-world datasets with 
ground truth show that building selective ensembles is effec¬ 
tive in boosting detection performance. SelectH appears to 
be a better strategy than SelectV, where it either provides the 
best result (6/8 in Table 2) or achieves comparable accuracy 
when SelectV is the winner. Selecting results based on diver¬ 
sity turns out to be a poor strategy for anomaly ensembles as 
DivE yields even worse results than the Full ensemble (6/8 
in Table 2). Noise analysis further corroborates the fact that 
DivE selects inaccurate/noisy results for the sake of diversity 
and declines in accuracy much faster than the rest. 

Future work will investigate how to go beyond binary 
selection and estimate weights for the detector/consensus 
results. We will also apply SELECT to the outlier mining 
problem in multi-dimensional point data and continue to 
enhance it with other detector and consensus methods. 
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Appendix 

A SELECT Base Algorithms for Event Detection 
A.l Eigen Behavior based Event Detection (EBED). 

The multi-variate time series contain the feature values of 
each node over time and can be represented as a n x 
t data matrix, for n nodes and t time points. EBED 
[29] defines sliding time windows of length w over the 
series and computes the principal left singular vector of 
each n x w matrix W. This vector is the same as the 
principal eigenvector of WW T and is always positive due 
to the Perron-Frobenius theorem [34]. Each eigenvector 
u(t) is treated as the “eigen-behavior” of the system during 
time window t, the entries of which are interpreted as the 
“activity” of each node. 

To score the time points, EBED computes the similarity 
between eigen-behavior u(t) and a summary of past eigen- 
behaviors r(t), where r(t) is the arithmetic average of u(t')’s 
for t' < t. The anomalousness score of time point t is 
then Z = 1 — u(t)-r(t ) € [0,1], where high value of Z 
indicates a change point. For each anomalous time point i, 
EBED performs attribution by computing the relative change 
^1 eac ^ Iloc ^ e * at The higher the relative 
change, the more anomalous the node is. 

A.2 Probabilistic Time Series Anomaly Detection (PT- 
SAD). A common approach to time series anomaly detec¬ 
tion is to probabilistically model a given series and detect 
anomalous time points based on their likelihood under the 
model. PTSAD models each series with four different para¬ 
metric models and performs model selection to identify the 
best fit for each series. Our first model is the Poisson, which 
is used often for fitting count data. However, Poisson is not 
sufficient for sparse series with many zeros. Since real-world 
data is frequently characterized by over-dispersion and ex¬ 
cess number of zeros, we employ a second model called 
Zero-Inflated Poisson (ZIP) [32] to account for data sparsity. 

We further look for simpler models which fit data with 
many zeros and employ the Hurdle models [35]. Rather 
than using a single but complex distribution. Hurdle mod¬ 
els assume that the data is generated by two simple, sepa¬ 
rate processes; (i) the hurdle and (ii) the count processes. 
The hurdle process determines whether there exists activity 
at a given time point and in case of activity the count pro¬ 
cess determines the actual (positive) counts. For the hurdle 
process, we employ two different models. First is the inde¬ 
pendent Bernoulli and the second is the first order Markov 
model which better captures the dependencies, where an 
activity influences the probability of subsequent activities. 
For the count process, we use the Zero-Truncated Poisson 
(ZTP) [30]. 

Overall we model each time series with four different 
models: Poisson, ZIP, Bernoulli+ZTP and Markov+ZTP. We 


then employ Vuong’s likelihood ratio test [36] to select the 
best model for individual series. Note that the best-fit model 
for each series may be different. 

To score the time points, we perform a single-sided test 
to compute a p-value for each value a: in a given series; i.e., 
P(X > x) = 1 — cdfnix) + pdfn{x), where H is the 
best-fit model for the series. The lower the p-value, the 
more anomalous the time point is. We then aggregate all 
the p-values from all the series per time point by taking the 
normalized sum of the p-values and inverting them to obtain 
scores £ [0,1] (s.t. higher is more anomalous). For each 
anomalous time point t, attribution is done by sorting the 
nodes (i.e., the series) based on their p-values at t. 

A.3 Streaming Pattern DIscoveRy in multiple Time- 
Series (SPIRIT). SPIRIT [33] can incrementally capture 
correlations, discover trends, and dynamically detect change 
points in multi-variate time series. The main idea is to rep¬ 
resent the underlying trends of a large number of numerical 
streams with a few hidden variables, where the hidden vari¬ 
ables are the projections of the observed streams onto the 
principal direction vectors (eigenvectors). These discovered 
trends are exploited for detecting change points in the series. 

The algorithm starts with a specific number of hidden 
variables that capture the main trends of the data. When¬ 
ever the main trends change, new hidden variables are intro¬ 
duced or several of existing ones are discarded to capture the 
change. SPIRIT can further quantify the change in the indi¬ 
vidual time series for attribution through their participation 
weights, which are the entries in the principal direction vec¬ 
tors. For further details on the algorithm, we refer the reader 
to the original paper by Papadimitriou et al. [33], 

A.4 Anomalous Subspace based Event Detection 
(ASED). ASED [31] is based on the separation of high¬ 
dimensional space occupied by the time series into two 
disjoint subspaces, the normal and the anomalous subspaces. 
Principal Component Analysis is used to separate the high¬ 
dimensional space, where the major principal components 
capture the most variance of the data and hence, construct 
the normal subspace and the minor principal components 
capture the anomalous subspace. The projection of the time 
series data onto these two subspaces reflect the normal and 
anomalous behavior. To score the time points, ASED uses 
the squared prediction error (SPE) of the residuals in the 
anomalous subspace. The residual values associated with 
individual series at the anomalous time points are used to 
measure the anomalousness of nodes for attribution. For the 
specifics of the algorithm, we refer to the original paper by 
Lakhinaefa/. [31]. 

A.5 Moving Average based Event Detection (MAED). 

MAED is a simple approach that calculates the moving 



average /j t and the moving standard deviation n t of each 
time series corresponding to each node by extending the time 
window one point at a time. If the value at a specific time 
point is more than three moving standard deviations away 
from the mean, then the point is considered as anomalous 
and assigned a non-zero score. The anomalousness score is 
the difference between the original value and (p t + 3<r t ) at t. 
To score the time points collectively, MAED aggregates their 
scores across all the series. For each anomalous time point t, 
attribution is done by sorting the nodes (i.e., the series) based 
on the individual scores they assign to t. 

B Additional Evaluation Results 


Table 4: Accuracy of ensembles for RealityMining Voice Call 
(directed) (10 components) (features: weighted in-/out-degree). 

*: selected detector/consensus results. 




Full 

DivE 

SelectV 

SelectH 


EBED (win) 

0.3508 

* 




PTSAD (win) 

0.6284 



* 


SPIRIT (win) 

0.8309 

* 


* 

*g 

ASED (win) 

0.9437 


* 

* 


MAED (win) 

0.8809 

* 


* 


EBED (wout) 

0.4122 

* 




PTSAD (wout) 

0.6273 



* 

R 

03 

SPIRIT (wout) 

0.7346 


* 

* 


ASED (wout) 

0.9500 



* 


MAED (wout) 

0.8758 



* 


Inverse Rank 

* 0.7544 

0.6169 

0.8880 

* 0.8222 


Kemeny-Young 

* 0.8221 

* 0.7708 

0.8619 

* 0.9309 

5 

RRA 

* 0.8154 

0.5936 

0.8901 

* 0.9416 

R 

<<5 

Uni (avg) 

* 0.7798 

* 0.6413 

* 0.8370 

* 0.9098 

R 

6 

Uni (max) 

* 0.6704 

0.5757 

0.7786 

* 0.7833 

MM (avg) 

* 0.9190 

* 0.9162 

0.8835 

* 0.9183 


MM (max) 

* 0.4380 

* 0.8934 

0.7569 

0.4380 

Final Ensemble 

0.7302 

0.8724 

0.8370 

0.9045 


Table 3: Accuracy of ensembles for Enronlnc. (directed) (20 
components) (features: weighted in-/out-degree and unweighted in- 
/out-degree). * depicts selected detector/consensus results. 




Full 

DivE 

SelectV 

Select FI 


EBED (win) 

0.1313 

* 


* 


PTSAD (win) 

0.1462 

* 


* 


SPIRIT (win) 

0.7032 

* 


* 


ASED (win) 

0.5470 

* 

* 

* 


MAED (win) 

0.6670 



* 


EBED (wout) 

0.2846 

* 




PTSAD (wout) 

0.2118 

* 


* 

1 

*g 

SPIRIT (wout) 

0.4563 



* 

ASED (wout) 

0.0580 

* 



c. 

MAED (wout) 

0.7328 


* 

* 


EBED (uin) 

0.0892 

* 



>5 

PTSAD (uin) 

0.1607 

* 


* 


SPIRIT (uin) 

0.3996 

* 


* 


ASED (uin) 

0.1395 


* 

* 


MAED (uin) 

0.4439 


* 

* 


EBED (uout) 

0.0225 

* 




PTSAD (uout) 

0.2546 



* 


SPIRIT (uout) 

0.1012 

* 


* 


ASED (uout) 

0.0870 

* 




MAED (uout) 

0.4181 



* 


Inverse Rank 

* 0.7121 

* 0.5660 

0.6577 

* 0.7496 


Kemeny-Young 

* 0.3033 

* 0.2495 

0.5361 

* 0.5066 

s 

RRA 

* 0.5948 

* 0.5348 

0.4948 

* 0.5774 

R 

Uni (avg) 

* 0.4838 

* 0.4325 

* 0.6047 

* 0.5336 

R 

R 

o 

Uni (max) 

* 0.3020 

* 0.2242 

0.6633 

* 0.4280 

MM (avg) 

* 0.5673 

* 0.4662 

0.6761 

* 0.7217 


MM (max) 

* 0.0216 

* 0.0216 

* 0.5355 

0.0222 

Final Ensemble 

0.5420 

0.4697 

0.7018 

0.7798 


Table 5: Accuracy of ensembles for RealityMining Voice Call 
(directed) (20 components) (features: weighted in-/out-degree and 
unweighted in-/out-degree) *: selected detector/consensus results. 




Full 

DivE 

SelectV 

SelectFI 


EBED (win) 

0.3508 

* 




PTSAD (win) 

0.6284 



* 


SPIRIT (win) 

0.8309 


* 

* 


ASED (win) 

0.9437 


* 

* 


MAED (win) 

0.8809 

* 


* 


EBED (wout) 

0.4122 

* 




PTSAD (wout) 

0.6273 



* 

r 

SPIRIT (wout) 

0.7346 



* 

R 

*g 

ASED (wout) 

0.9500 



* 

£ 

MAED (wout) 

0.8758 



* 

go 

EBED (uin) 

0.4173 





PTSAD (uin) 

0.8636 

* 


* 

R 

03 

SPIRIT (uin) 

0.8313 



* 


ASED (uin) 

0.9191 



* 


MAED (uin) 

0.8706 

* 


* 


EBED (uout) 

0.4800 



* 


PTSAD (uout) 

0.8665 



* 


SPIRIT (uout) 

0.7480 



* 


ASED (uout) 

0.9229 

* 


* 


MAED (uout) 

0.9115 



* 


Inverse Rank 

* 0.8035 

0.7952 

0.9240 

* 0.8681 


Kemeny-Young 

* 0.9064 

0.9018 

0.9076 

* 0.9158 

s 

>3 

RRA 

* 0.8866 

* 0.7771 

0.9013 

* 0.9311 

R 

>5 

Uni (avg) 

* 0.8598 

0.9192 

* 0.8448 

* 0.9102 

R 

<3> 

Uni (max) 

* 0.6844 

* 0.6863 

0.8517 

* 0.7611 

MM (avg) 

* 0.9321 

* 0.9083 

* 0.8312 

* 0.9134 


MM (max) 

* 0.4380 

* 0.8858 

0.8015 

0.4380 

Final Ensemble 

0.8011 

0.8335 

0.8847 

0.8949 



Table 6: Accuracy of ensembles for RealityMining Bluetooth 
(undirected) (10 components) (feature: weighted and unweighted 
degree). *: selected detector/consensus results. 




Full 

DivE 

SelectV 

SelectH 


EBED (wdeg) 

0.4363 

* 




PTSAD (wdeg) 

0.5820 

* 


* 

g 

SPIRIT (wdeg) 

0.9499 

* 


* 

•g 

ASED (wdeg) 

0.8601 


* 

* 

Ci 

MAED (wdeg) 

0.8359 

* 


* 


EBED (udeg) 

0.4966 

* 




PTSAD (udeg) 

0.8694 


* 

* 

C 

03 

SPIRIT (udeg) 

0.9162 


* 

* 


ASED (udeg) 

0.7662 


* 

* 


MAED (udeg) 

0.8788 

* 


* 


Inverse Rank 

* 0.8646 

* 0.8255 

0.8790 

* 0.8538 


Kemeny-Young 

* 0.9534 

0.9169 

0.9698 

* 0.9361 

5 

RRA 

* 0.9413 

0.8318 

0.9693 

* 0.9684 

s 

>5 

Uni (avg) 

* 0.9071 

0.8654 

* 0.9193 

* 0.9225 

s 

c 

r \ 

Uni (max) 

* 0.6973 

* 0.6122 

0.8270 

* 0.7126 

'sJ 

MM (avg) 

* 0.9407 

* 0.9340 

0.8596 

* 0.8892 


MM (max) 

* 0.6461 

* 0.6374 

0.8830 

0.6461 

Final Ensemble 

0.8398 

0.7735 

0.9193 

0.8886 


Table 7: Accuracy of ensembles for RealityMining SMS (directed) 
(10 components) (features: weighted in-/out-degree). *: selected 
detector/consensus results. 




Full 

DivE 

SelectV 

SelectH 


EBED (win) 

0.6117 

* 




PTSAD (win) 

0.7003 



* 

'•'5 

s 

SPIRIT (win) 

0.9256 


* 

* 

•g 

ASED (win) 

0.6338 

* 


* 

o 

MAED (win) 

0.9002 

* 


* 

Oc 

EBED (wout) 

0.5595 

* 



>3 

PTSAD (wout) 

0.7023 


* 

* 

cs 

03 

SPIRIT (wout) 

0.8656 

* 


* 


ASED (wout) 

0.9102 


* 

* 


MAED (wout) 

0.9259 


* 

* 


Inverse Rank 

* 0.8309 

0.8174 

0.8933 

* 0.8044 


Kemeny-Young 

* 0.9491 

* 0.8779 

0.9511 

* 0.9386 

s 

>3 

RRA 

* 0.8761 

* 0.8424 

0.9578 

* 0.9516 

>3 

Uni (avg) 

* 0.8531 

0.8247 

* 0.9283 

* 0.8684 

e 

c 

o 

Uni (max) 

* 0.8205 

* 0.7632 

0.8829 

* 0.8678 

MM (avg) 

* 0.9276 

* 0.9487 

0.9492 

* 0.9084 


MM (max) 

* 0.8907 

* 0.8577 

0.9410 

0.9011 

Final Ensemble 

0.9092 

0.8598 

0.9283 

0.9217 


Table 8: Accuracy of ensembles for RealityMining SMS (di¬ 
rected) (20 components) (features: weighted in-/out-degree and un¬ 
weighted in-/out-degree). *: selected detector/consensus results. 




Full 

DivE 

SelectV 

SelectH 


EBED (win) 

0.6117 

* 


* 


PTSAD (win) 

0.7003 



* 


SPIRIT (win) 

0.9256 



* 


ASED (win) 

0.6338 

* 


* 


MAED (win) 

0.9002 



* 


EBED (wout) 

0.5595 



* 


PTSAD (wout) 

0.7023 




s 

SPIRIT (wout) 

0.8656 



* 

•g 

ASED (wout) 

0.9102 



* 

’C! 

c 

MAED (wout) 

0.9259 


* 

* 

Oo 

EBED (uin) 

0.4407 

* 



>3 

PTSAD (uin) 

0.7809 

* 


* 

03 

SPIRIT (uin) 

0.7841 



* 


ASED (uin) 

0.6248 

* 


* 


MAED (uin) 

0.8297 

* 


* 


EBED (uout) 

0.3246 





PTSAD (uout) 

0.9157 


* 

* 


SPIRIT (uout) 

0.8744 



* 


ASED (uout) 

0.9150 



* 


MAED (uout) 

0.8005 



* 


Inverse Rank 

* 0.9135 

0.6751 

0.9634 

* 0.9230 


Kemeny-Young 

* 0.9286 

* 0.7567 

0.9094 

* 0.9325 

s 

>3 

RRA 

* 0.9568 

0.6465 

0.9418 

* 0.9583 

s 

>3 

Uni (avg) 

* 0.8791 

0.6499 

* 0.9294 

* 0.9156 

<3> 

O 

Uni (max) 

* 0.7173 

* 0.6696 

0.9342 

0.8650 

MM (avg) 

* 0.9107 

* 0.8942 

0.8519 

* 0.9138 


MM (max) 

* 0.8895 

* 0.8480 

0.9307 

0.8895 

Final Ensemble 

0.9542 

0.8749 

0.9294 

0.9621 


Table 9: Accuracy of ensembles for TwitterSecurity (undirected) 
(10 components) (features: weighted and unweighted degree). *: 
selected detector/consensus results. 




Full 

DivE 

SelectV 

SelectH 


EBED (wdeg) 

0.4000 

* 


* 


PTSAD (wdeg) 

0.5400 

* 


* 

g 

SPIRIT (wdeg) 

0.4467 

* 

* 

* 

•g 

ASED (wdeg) 

0.6200 

* 

* 

* 

’C! 

c 

MAED (wdeg) 

0.4933 



* 

oc 

EBED (udeg) 

0.4133 

* 

* 

* 


PTSAD (udeg) 

0.5467 


* 

* 

C5 

03 

SPIRIT (udeg) 

0.3867 

* 


* 


ASED (udeg) 

0.5400 

* 


* 


MAED (udeg) 

0.4533 

* 




Inverse Rank 

* 0.4467 

0.4267 

0.5133 

* 0.4667 


Kemeny-Young 

* 0.5667 

0.5333 

0.5333 

0.5800 

3 

>3 

RRA 

* 0.5867 

* 0.5333 

0.5467 

* 0.5933 

s; 

<<3 

Uni (avg) 

* 0.5600 

0.5000 

* 0.5467 

* 0.6000 

<3> 
r \ 

Uni (max) 

* 0.4533 

* 0.4400 

0.5800 

0.4533 

'sJ 

MM (avg) 

* 0.5333 

* 0.5667 

0.5267 

0.5600 


MM (max) 

* 0.3667 

* 0.3667 

0.5533 

0.5733 

Final Ensemble 

0.5200 

0.4800 

0.5467 

0.5867 



Table 10: Challenge Network: (Feature: unweighted in & outde- 
gree) 



Algorithms 

Full 

DivE 

SelectV 

SelectH 


EBED (uin) 

0.1587 

* 


* 


PTSAD (uin) 

0.5208 

* 


* 

'•'5 

s 

SPIRIT (uin) 

0.7556 

* 

* 

* 


ASED (uin) 

1.0000 

* 

* 

* 

O 

MAED (uin) 

1.0000 


* 

* 

Oc 

EBED (uout) 

0.7333 

* 


* 


PTSAD (uout) 

0.0994 

* 


* 

Q 

03 

SPIRIT (uout) 

0.4021 

* 

* 

* 


ASED (uout) 

1.0000 

* 

* 

* 


MAED (uout) 

1.0000 

* 

* 

* 


Inverse Rank 

* 1.0000 

* 1.0000 

* 1.0000 

* 1.0000 


Kemeny-Young 

* 0.9167 

0.9167 

* 1.0000 

* 0.9167 

a 

RRA 

* 0.9167 

* 0.9167 

1.0000 

* 0.9167 


Uni (avg) 

* 1.0000 

* 1.0000 

* 1.0000 

* 1.0000 

S 

O 
r \ 

Uni (max) 

* 0.2778 

* 0.2778 

1.0000 

0.2778 


MM (avg) 

* 1.0000 

* 1.0000 

* 1.0000 

* 1.0000 


MM (max) 

* 0.7000 

0.7000 

1.0000 

* 0.7000 

Final Ensemble 

1.0000 

1.0000 

1.0000 

1.0000 


Table 11: Ensemble results for Challenge Network 


Method 

AP 

significance 

Full 

1.0000 

n/a 

DivE 

1.0000 

n/a 

(i) Random Ensemble 

0.8831 

±0.1380 

(6/10 + 4/7) 



SelectV 

1.0000 

= fj, + 0.8471 a 

(ii) Random Ensemble 

0.9405 

±0.0407 

(10/10 + 6/7) 



SelectH 

1.0000 

= i_i ± 1.4619a 


Table 12: Challenge Network: Characterization of time tick 377 
(Feature: unweighted in & outdegree) 



Algorithms 

Full 

DivE 

SelectV 

SelectH 


EBED (uin) 

1.0000 

* 


* 


PTSAD (uin) 

0.2000 

* 


* 

'•'5 

s 

SPIRIT (uin) 

0.5000 

* 


* 

-g 

ASED (uin) 

1.0000 


* 

* 

O 

MAED (uin) 

1.0000 


* 

* 

Oc 

EBED (uout) 

0.2500 



* 

>5 

PTSAD (uout) 

0.2000 



* 

Q 

03 

SPIRIT (uout) 

0.0270 

* 




ASED (uout) 

1.0000 

* 


* 


MAED (uout) 

1.0000 

* 

* 

* 


Inverse Rank 

* 1.0000 

* 1.0000 

* 1.0000 

* 1.0000 


Kemeny-Young 

* 1.0000 

* 1.0000 

1.0000 

* 1.0000 

s 

RRA 

* 1.0000 

* 1.0000 

* 1.0000 

* 1.0000 


Uni (avg) 

* 1.0000 

* 1.0000 

* 1.0000 

* 1.0000 

<3> 
r \ 

Uni (max) 

* 0.5000 

0.5000 

1.0000 

* 0.5000 


MM (avg) 

* 1.0000 

* 1.0000 

0.2500 

* 1.0000 


MM (max) 

* 0.0147 

* 0.0200 

0.2500 

0.0147 

Final Ensemble 

1.0000 

1.0000 

1.0000 

1.0000 


Table 13: Challenge Network: (Feature: weighted in & outdegree 
and unweighted in & outdegree) 



Algorithms 

Full 

DivE 

SelectV 

SelectH 


EBED (win) 

0.0054 

* 


* 


PTSAD (win) 

0.1183 

* 

* 

* 


SPIRIT (win) 

0.0039 


* 

* 


ASED (win) 

0.0178 

* 


* 


MAED (win) 

0.0568 



* 


EBED (wout) 

0.0064 

* 


* 


PTSAD (wout) 

0.6717 

* 

* 

* 

s 

SPIRIT (wout) 

0.0036 



* 

K 

-g 

ASED (wout) 

0.0104 

* 

* 

* 

© 

MAED (wout) 

0.0494 

* 


* 

Oc 

EBED (uin) 

0.1587 

* 


* 

03 

PTSAD (uin) 

0.5208 

* 

* 

* 

a 

03 

SPIRIT (uin) 

0.7556 

* 


* 


ASED (uin) 

1.0000 

* 


* 


MAED (uin) 

1.0000 



* 


EBED (uout) 

0.7333 

* 


* 


PTSAD (uout) 

0.0994 

* 


* 


SPIRIT (uout) 

0.4021 

* 


* 


ASED (uout) 

1.0000 

* 


* 


MAED (uout) 

1.0000 

* 


* 


Inverse Rank 

* 0.8333 

* 0.8333 

* 0.4712 

* 0.8333 


Kemeny-Young 

* 0.4720 

* 0.5167 

* 0.1730 

* 0.4720 

a 

RRA 

* 0.8095 

* 0.7778 

* 0.3827 

* 0.8095 

a 

>5 

Uni (avg) 

* 0.5000 

* 0.6250 

* 0.0956 

* 0.0339 

a 

6 

Uni (max) 

* 0.2778 

* 0.2778 

* 0.3503 

* 0.0219 

MM (avg) 

* 0.0244 

* 0.3432 

* 0.0093 

* 0.0244 


MM (max) 

* 0.0151 

* 0.0151 

* 0.0499 

0.0151 

Final Ensemble 

0.6325 

0.6667 

0.3575 

0.7500 


Table 14: Ensemble results for Challenge Network 


Method 

AP 

significance 

Full 

0.6325 

n/a 

DivE 

0.6667 

n/a 

(i) Random Ensemble 

0.5975 

±0.2599 

(5/20 + 7/7) 



SelectV 

0.3575 

= /./ — 0.9234a 

(ii) Random Ensemble 

0.6128 

±0.0881 

(20/20 + 6/7) 



SelectH 

0.7500 

= n + 1.5573a 



Table 15: Challenge Network: Characterization of time tick 377 
(Feature: weighted in & outdegree and unweighted in & outdegree) 



Algorithms 

Full 

DivE 

SelectV 

SelectH 


EBED (win) 

0.0909 

* 

* 

* 


PTSAD (win) 

1.0000 

* 


* 


SPIRIT (win) 

0.0238 





ASED (win) 

0.3333 


* 

* 


MAED (win) 

1.0000 

* 


* 


EBED (wout) 

0.0385 





PTSAD (wout) 

0.1429 

* 


* 


SPIRIT (wout) 

0.0133 

* 



-g 

ASED (wout) 

0.2500 

* 

* 

* 

o 

MAED (wout) 

1.0000 

* 


* 


EBED (uin) 

1.0000 

* 


* 

cn 

PTSAD (uin) 

0.2000 

* 

* 

* 

03 

SPIRIT (uin) 

0.5000 

* 




ASED (uin) 

1.0000 



* 


MAED (uin) 

1.0000 


* 

* 


EBED (uout) 

0.2500 


* 

* 


PTSAD (uout) 

0.2000 



* 


SPIRIT (uout) 

0.0270 

* 




ASED (uout) 

1.0000 

* 


* 


MAED (uout) 

1.0000 



* 


Inverse Rank 

* 1.0000 

* 1.0000 

1.0000 

* 1.0000 


Kemeny-Young 

* 1.0000 

* 1.0000 

* 1.0000 

* 1.0000 

3 

RRA 

* 1.0000 

* 1.0000 

1.0000 

* 1.0000 


Uni (avg) 

* 1.0000 

* 1.0000 

* 1.0000 

* 1.0000 

(3 

Uni (max) 

* 0.1667 

* 0.2000 

0.5000 

0.3333 


MM (avg) 

* 1.0000 

1.0000 

1.0000 

* 1.0000 


MM (max) 

* 0.0128 

* 0.0128 

0.0189 

0.0135 

Final Ensemble 

1.0000 

1.0000 

1.0000 

1.0000 
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Enron: 20 components RealityMining(Voice Call): 20 components 




(a) Enronlnc. (b) RealityMining VC 


RealityMining(SMS): 20 components 



(c) RealityMining SMS 


Figure 10: Analysis of accuracy when increasing number of ran¬ 
dom base results are introduced for ensembles with 20 components. 
Decline in accuracy under noise is most prominent for DivE. 









































